Setup

Welcome! While we’re waiting:

Introduction

Outline

US Census Data

The “nation’s leading provider of quality data about its people and economy.”

Available at www.census.gov

Primary Census Products

Decennial Census

Complete count of the population every 10 years since 1790

Includes data on

American Community Survey (ACS)

Decennial Census* vs ACS Data

Demographic* Social Economic Housing
Sex Families Income Tenure*
Age Education Benefits Occupancy*
Race Marital Status Employment Status Structure Type
Hispanic Origin Fertility Occupation Housing Value
Grandparents Industry Taxes & Insurance
Veterans Commuting Utilities
Disability Status Place of Work Mortgage
Language at Home Health Insurance Monthly Rent
Citizenship
Mobility

Census Geographies

Census data are publicly available at one or more levels of geographic aggregation.

Census Data & Census Geographies

Census Data Workflow

Identify your - topic of interest - year(s) - geographic level of detail - for what locations?

Then determine what specific tables and variables are available - ACS or Decennial?

CAUTION

“If you want to measure change you can’t change the measures!”

Census tables, variables, geographies, and geographic boundaries change over time!

Measuring change over time with census data is it’s own thing, complex and not covered by this workshop!

Packages for Working with Census Data

These are the ones we recommend and will use today.

tidycensus & tigris

tidycensus

Functions for accessing census decennial and ACS 5 year datasets via Census APIs

  • only a subset of datasets / years available
  • requires a Census API key

tidycensus

Limited set of years available via tidycensus

  • decennial census: 1990, 2000, and 2010
  • ACS 5 yr: 2005-2010 through 2012-2016 are available.
  • Note: tidycensus referes to ACS 5year datasets by the endyear.

Most recent ACS available is 2012-2016

  • 2013 - 2017 released Dec 6, 2018 by the Census. Need to check when available in tidycensus & update package.

tigris

Provides access to census geographic data files

- detailed TIGER/Line boundary files (e.g., shapefiles), or
- simplified Cartographic boundary files

Also provides access to census feature data, - eg, rivers, roads, coastlands, landmarks, and more

Used by tidycensus to access state, county, tract, block group, block, and ZCTA boundaries.

  • Use tigris directly to access other census geographic data.

tidycensus & tigris

Packages developed by Kyle Walker to make it easier to fetch data from Census websites and APIs in R and get that data in a useable format to analyze, plot, and map.

Check out his website to keep abreast of his great packages, blog posts and tutorials.

Walker also develped a new DataCamp course Analyzing US Census Data in R!

  • Highly recommended! First chapter free!

Alternatives

You can write code to access the Census APIs directly.

You can download Census data directly from:

You can download Census geographic data directly on the census website

tidyverse

A collection of R Packages for data science - developed primarily by Hadley Wickham, Chief Scientist at RStudio.

  • dplyr and tidyr for reshaping data

  • ggplot2 for plotting

  • purr, readr and tibble for improved performance

These packages are used by tidyverse under the hood. ## sf

Simple features for geospatial data objects and methods.

  • Next generation R package for working with vector geospatial data
    • will soon supercede the sp package

sf includes the functionality of the sp, rgdal, rgeos and proj4 packages.

  • but with improved performance, simplified command syntax and easier workflows.

Tutorial Time!

Part 1.

We will work through several exercises using tidycensus to fetch, wrangle and map census data.

Loading packages

Load the packages we will use today

library(tidycensus)
library(tidyverse) 
library(tigris)
library(sf)
library(dplyr)

Install any packages that you do not have on your computer

Also install any dependancies.

# install.packages("tidyverse")
# install.packages("tidycensus")
# install.packages("sf")

Census API Key

You need a census API key to programmatically fetch census data.

For more info see:

# Install your census api key. 
# If you set install=TRUE you need only do this once.

#census_api_key("YOUR KEY GOES HERE", install = TRUE)

my_census_api_key <- "f2d6f4f743545d3a42a67412b05935dc7712c432"
census_api_key(my_census_api_key)
## To install your API key for use in future sessions, run this function with `install = TRUE`.

Set working directory

to the folder in which you unzipped the workshop files, e.g.,

setwd("~/Documents/Dlab/workshops/2018/rCensus_workshop")

Fetching Data from the Decennial Census

Population Data

Let’s start by fetching population data for each state from the 2010 Census

Identify Tables & Variables

First step is to identify the names of the census variables that contain the data of interest.

The tidycensus load_variables function can help with that.

First, take a look at the function documentation.

?load_variables

load_variables

Use load_variables to fetch all variables used in the 2010 census into a dataframe.

vars2010 <- load_variables(year=2010,        # Year or end year for ACS
                           dataset = 'sf1',  # 'sf1' for decennial or 'acs5'
                           cache = TRUE)     # Whether to save fetched data locally

Decennial Census Variables

Let’s take a look at and discuss the resultant dataframe.

  • How many 2010 census variables are in the table?
View(vars2010)

2010 Decennial Census Tables

  • Topics: Population, housing

  • Tables: currenty 333
    • 177 population tables (identified with a ‘‘P’’) available to the block level
    • 58 housing tables (identified with an ‘‘H’’) available to the block level
    • 82 population tables (identified with a ‘‘PCT’’) available to the census tract level
    • 4 housing tables (identified with an “HCT”) available to the census tract level
    • 10 population tables (identified with a “PCO”) available to the county level
    • plus 2 additoinal PCT tables

get_decennial

get_decenial is the tidycensus function for fetching data from the decennial census API.

First, check the documentation for the function.

?get_decennial

get_decennial

Let’s fetch total population by state (P001001) from the 2010 census using get_decennial.

pop2010 <- get_decennial(geography = "state",   # census tabulation unit
                         variables = "P001001", # variable(s) of interest
                         year = 2010)           # census year
## Getting data from the 2010 decennial Census

View the Data

  • How many rows and columns?

  • What column has the pop counts?

pop2010
## # A tibble: 52 x 4
##    GEOID NAME                 variable     value
##    <chr> <chr>                <chr>        <dbl>
##  1 01    Alabama              P001001   4779736.
##  2 02    Alaska               P001001    710231.
##  3 04    Arizona              P001001   6392017.
##  4 05    Arkansas             P001001   2915918.
##  5 06    California           P001001  37253956.
##  6 08    Colorado             P001001   5029196.
##  7 09    Connecticut          P001001   3574097.
##  8 10    Delaware             P001001    897934.
##  9 11    District of Columbia P001001    601723.
## 10 12    Florida              P001001  18801310.
## # ... with 42 more rows

Visualize results

ggplot2 is the most commonly used R package for data visualization.

It is loaded when you load the tidyverse package.

  • Knowlege of ggplot very useful for both exploratory analysis and communicating results.
pop_plot<- ggplot(data=pop2010, aes(x=reorder(NAME,value), y=value/1000000)) + 
  geom_bar(stat="identity") + coord_flip() +
  theme_minimal() + 
  labs(title = "2010 US Population by State") +
  xlab("State") +
  ylab("in millions")

Display the plot

Challenge

Fetch population data by state for 2000.

Don’t assume variable names are the same across years. Check first!

Total Population in 2000

# What is the variable name in 2000?
vars2000 <- load_variables(year=2000, dataset = 'sf1', cache = T)

# Take a look and search in the dataframe
View(vars2000)

# Fetch the 2000 pop data
pop2000 <- get_decennial(geography = "state", variables = "P001001", year = 2000)

# Take a look
pop2000

Limiting by Area of Interest

In the previous example we retrieved population data for all states.

  • This is the default behavior if you don’t specify a subset.

  • You can limit the retrieved data by state or by county, if applicable.

Limit Areas of Interest

Let’s fetch data for just 3 states.

state_pop2010 <- get_decennial(geography = "state", # census tabulation unit
                         variables = "P001001",     # variables of interest
                         year = 2010,               # census year
                         state=c("CA","OR","WA"))   # Filter by states of interest
## Getting data from the 2010 decennial Census

Review Results

state_pop2010
## # A tibble: 3 x 4
##   GEOID NAME       variable     value
##   <chr> <chr>      <chr>        <dbl>
## 1 06    California P001001  37253956.
## 2 41    Oregon     P001001   3831074.
## 3 53    Washington P001001   6724540.

Changing Census Tabulation unit

get_decennial accepts a number of different values for tabulation unit.

  • Options include: state, county, tract, block group, block, and ZCTA.

Let’s change the tabulation unit to county.

co_pop2010 <- get_decennial(geography = "county",   # census tabulation unit
                            variables = "P001001",  # variables of interest
                            year = 2010)
## Getting data from the 2010 decennial Census

Changing Census Tabulation unit

View the Data to see what was retrieved.

str(co_pop2010)
## Classes 'tbl_df', 'tbl' and 'data.frame':    3221 obs. of  4 variables:
##  $ GEOID   : chr  "01001" "01007" "01017" "01019" ...
##  $ NAME    : chr  "Autauga County, Alabama" "Bibb County, Alabama" "Chambers County, Alabama" "Cherokee County, Alabama" ...
##  $ variable: chr  "P001001" "P001001" "P001001" "P001001" ...
##  $ value   : num  54571 22915 34215 25989 25833 ...
head(co_pop2010, 2)
## # A tibble: 2 x 4
##   GEOID NAME                    variable  value
##   <chr> <chr>                   <chr>     <dbl>
## 1 01001 Autauga County, Alabama P001001  54571.
## 2 01007 Bibb County, Alabama    P001001  22915.
tail(co_pop2010, 2)
## # A tibble: 2 x 4
##   GEOID NAME                           variable  value
##   <chr> <chr>                          <chr>     <dbl>
## 1 72151 Yabucoa Municipio, Puerto Rico P001001  37941.
## 2 72153 Yauco Municipio, Puerto Rico   P001001  42043.

Challenge

  1. Fetch population by county for just California

  2. Fetch population by county for Oregon & California

Challenge

  1. Fetch population by tract for all states.

  2. Fetch population by tract for California.

Fetching Census Tract Data

If you want census data at the tract level or below you need to specifiy the state & county or counties.

tract_pop2010 <- get_decennial(geography = "tract",   # census tabulation unit
                         variables = "P001001",       # variable of interest
                         year = 2010,                 # census year
                         state="CA",                  # limit to state of California
                         county=c("Alameda","Contra Costa"))  # and only these counties
## Getting data from the 2010 decennial Census
## Getting data from the 2010 decennial Census
## Getting data from the 2010 decennial Census

Fetching Census Tract Data

View the results! How many census tracts are in these 3 counties?

str(tract_pop2010)
## Classes 'tbl_df', 'tbl' and 'data.frame':    569 obs. of  4 variables:
##  $ GEOID   : chr  "06001400100" "06001400200" "06001400300" "06001400400" ...
##  $ NAME    : chr  "Census Tract 4001, Alameda County, California" "Census Tract 4002, Alameda County, California" "Census Tract 4003, Alameda County, California" "Census Tract 4004, Alameda County, California" ...
##  $ variable: chr  "P001001" "P001001" "P001001" "P001001" ...
##  $ value   : num  2937 1974 4865 3703 3517 ...
head(tract_pop2010, 2)
## # A tibble: 2 x 4
##   GEOID       NAME                                          variable value
##   <chr>       <chr>                                         <chr>    <dbl>
## 1 06001400100 Census Tract 4001, Alameda County, California P001001  2937.
## 2 06001400200 Census Tract 4002, Alameda County, California P001001  1974.
tail(tract_pop2010, 2)
## # A tibble: 2 x 4
##   GEOID       NAME                                          variable value
##   <chr>       <chr>                                         <chr>    <dbl>
## 1 06013392300 Census Tract 3923, Contra Costa County, Cali… P001001  3102.
## 2 06013990000 Census Tract 9900, Contra Costa County, Cali… P001001     0.

Challenge

  1. Fetch population by county for Alameda County, California

  2. Fetch population by tract for the nine county Bay Area:
  • Alameda, SF, Contra Costa, Marin County, Napa,
  • San Mateo, Santa Clara, Solano, Sonoma, Santa Cruz

Challenge hint

You can use names or FIPs codes as your state and county function arguments.

# County FIPs Codes for
# Alameda, SF, Contra Costa, Marin County, Napa, 
# San Mateo, Santa Clara,  Solano,  Sonoma, santa cruz
nine_counties <- c("001", "075", "013", "041", "055", "081", "085", "095", "097", "087")

# Your code below...
bayarea_pop10 <-...

Fetching data for more than one census variable

What three things are new here?

ur_pop10 <- get_decennial(geography = "county",  # census tabulation unit
                           variables = c(urban="P002002",rural="P002005"),
                           year = 2010, 
                           summary_var = "P002001",  # Total Urban - the denominator
                           state='CA',
                           county=c("Napa","Sonoma","Mendocino"))
## Getting data from the 2010 decennial Census

Take a look at the results

ur_pop10
## # A tibble: 6 x 5
##   GEOID NAME                         variable   value summary_value
##   <chr> <chr>                        <chr>      <dbl>         <dbl>
## 1 06045 Mendocino County, California urban     48110.        87841.
## 2 06055 Napa County, California      urban    118194.       136484.
## 3 06097 Sonoma County, California    urban    424102.       483878.
## 4 06045 Mendocino County, California rural     39731.        87841.
## 5 06055 Napa County, California      rural     18290.       136484.
## 6 06097 Sonoma County, California    rural     59776.       483878.

Calculating Percents

The summary_value column comes in handy when you want to compute percent of total.

Here’s one way to do it.

ur_pop10 <- ur_pop10 %>%
            mutate(pct = 100 * (value / summary_value))
## Warning: package 'bindrcpp' was built under R version 3.4.4

Calculating Percents

Let’s take a look at the output

ur_pop10 # Take a look
## # A tibble: 6 x 6
##   GEOID NAME                         variable   value summary_value   pct
##   <chr> <chr>                        <chr>      <dbl>         <dbl> <dbl>
## 1 06045 Mendocino County, California urban     48110.        87841.  54.8
## 2 06055 Napa County, California      urban    118194.       136484.  86.6
## 3 06097 Sonoma County, California    urban    424102.       483878.  87.6
## 4 06045 Mendocino County, California rural     39731.        87841.  45.2
## 5 06055 Napa County, California      rural     18290.       136484.  13.4
## 6 06097 Sonoma County, California    rural     59776.       483878.  12.4

Plot it

Plots give us compact visual summaries of the data

myplot <- ggplot(data = ur_pop10, 
          mapping = aes(x = NAME, fill = variable, 
                     y = ifelse(test = variable == "urban", 
                                yes = -pct, no = pct))) +
          geom_bar(stat = "identity") +
          scale_y_continuous(labels = abs, limits=c(-100,100)) +
          labs(title="Urban & Rural Population in Wine Country", 
               x="County", y = " Percent of Population", fill="") +
          coord_flip()

Plot it

myplot

Fetch all the data in one table

This is often helpful but you need to keep tract of the meaning of each variable.

alco_pop10 <- get_decennial(geography = "tract", # Census tabulation unit
                           table =  "P002",      # Table of urban & rural population counts
                           year = 2010,          # Decennial census year
                           state='CA',           # Filter state
                           county="Alameda")     # Filter county
## Getting data from the 2010 decennial Census

Take a look

unique(alco_pop10$variable)
## [1] "P002001" "P002002" "P002003" "P002004" "P002005" "P002006"
alco_pop10
## # A tibble: 2,166 x 4
##    GEOID       NAME                                         variable value
##    <chr>       <chr>                                        <chr>    <dbl>
##  1 06001400100 Census Tract 4001, Alameda County, Californ… P002001  2937.
##  2 06001400200 Census Tract 4002, Alameda County, Californ… P002001  1974.
##  3 06001400300 Census Tract 4003, Alameda County, Californ… P002001  4865.
##  4 06001400400 Census Tract 4004, Alameda County, Californ… P002001  3703.
##  5 06001400500 Census Tract 4005, Alameda County, Californ… P002001  3517.
##  6 06001400600 Census Tract 4006, Alameda County, Californ… P002001  1571.
##  7 06001400700 Census Tract 4007, Alameda County, Californ… P002001  4206.
##  8 06001400800 Census Tract 4008, Alameda County, Californ… P002001  3594.
##  9 06001400900 Census Tract 4009, Alameda County, Californ… P002001  2302.
## 10 06001401000 Census Tract 4010, Alameda County, Californ… P002001  5678.
## # ... with 2,156 more rows

Fetching Data for multiple years

You cannot do this with one tidycensus function call unless you wrap it in a function.

  • See Extra section at the end of tutorial.
pop2000 <- get_decennial(geography = "state", variables = "P001001", year = 2000)
pop2010 <- get_decennial(geography = "state", variables = "P001001", year = 2010)

Output options

Let’s try all three of these commands and then look at the ouput to see what’s different?

pop2010a <- get_decennial(geography = "state", variables = "P001001", year = 2010)
pop2010b <- get_decennial(geography = "state", variables = c(pop10="P001001"), year = 2010)
pop2010c <- get_decennial(geography = "state", variables = c(pop00="P001001"), year = 2010, output="wide")

Data Wrangling

Your R skills can help you reformat the data and make it more useable.

First, fetch population data for 2010 & 2000 by state with output=wide.

Label the variables pop00 and pop10.

Data Wrangling

pop2000 <- get_decennial(geography = "state", variables = c(pop00="P001001"), year = 2000, output="wide")
## Getting data from the 2000 decennial Census
pop2010 <- get_decennial(geography = "state", variables = c(pop10="P001001"), year = 2010, output="wide")
## Getting data from the 2010 decennial Census

Merge population by state from both censuses

pop2000_2010 <- pop2000 %>% merge(pop2010, by="NAME") %>%
                             select(NAME, pop00, pop10)

Data Wrangling

View results

pop2000_2010
##                    NAME    pop00    pop10
## 1               Alabama  4447100  4779736
## 2                Alaska   626932   710231
## 3               Arizona  5130632  6392017
## 4              Arkansas  2673400  2915918
## 5            California 33871648 37253956
## 6              Colorado  4301261  5029196
## 7           Connecticut  3405565  3574097
## 8              Delaware   783600   897934
## 9  District of Columbia   572059   601723
## 10              Florida 15982378 18801310
## 11              Georgia  8186453  9687653
## 12               Hawaii  1211537  1360301
## 13                Idaho  1293953  1567582
## 14             Illinois 12419293 12830632
## 15              Indiana  6080485  6483802
## 16                 Iowa  2926324  3046355
## 17               Kansas  2688418  2853118
## 18             Kentucky  4041769  4339367
## 19            Louisiana  4468976  4533372
## 20                Maine  1274923  1328361
## 21             Maryland  5296486  5773552
## 22        Massachusetts  6349097  6547629
## 23             Michigan  9938444  9883640
## 24            Minnesota  4919479  5303925
## 25          Mississippi  2844658  2967297
## 26             Missouri  5595211  5988927
## 27              Montana   902195   989415
## 28             Nebraska  1711263  1826341
## 29               Nevada  1998257  2700551
## 30        New Hampshire  1235786  1316470
## 31           New Jersey  8414350  8791894
## 32           New Mexico  1819046  2059179
## 33             New York 18976457 19378102
## 34       North Carolina  8049313  9535483
## 35         North Dakota   642200   672591
## 36                 Ohio 11353140 11536504
## 37             Oklahoma  3450654  3751351
## 38               Oregon  3421399  3831074
## 39         Pennsylvania 12281054 12702379
## 40         Rhode Island  1048319  1052567
## 41       South Carolina  4012012  4625364
## 42         South Dakota   754844   814180
## 43            Tennessee  5689283  6346105
## 44                Texas 20851820 25145561
## 45                 Utah  2233169  2763885
## 46              Vermont   608827   625741
## 47             Virginia  7078515  8001024
## 48           Washington  5894121  6724540
## 49        West Virginia  1808344  1852994
## 50            Wisconsin  5363675  5686986
## 51              Wyoming   493782   563626

Quick Combo Plot

To check my data values

ggplot() +
  geom_bar(data=pop2000_2010, aes(x=reorder(NAME,pop10), y=pop10/1000000), stat="identity") + coord_flip() +
  geom_point(data=pop2000_2010, aes(x=reorder(NAME,pop10), y=pop00/1000000, col="red")) + coord_flip()
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

Save the data

Use write.csv to save a data frame to a CSV file.

write.csv(pop2000_2010, file="pop2000_2010.csv", row.names = FALSE)

QUESTIONS?

Part 2. Mapping

map topics

  • how to fetch
  • cb vs tl
  • ggplot vs tmap vs plot
  • components of an sf
  • saving to shapefile
  • crs issues esp. with census data
  • shift_geo option
  • other available data

Mapping Census Data with tidycensus

We can fetch geographic data with tidycensus by adding geometry=TRUE to our function calls.

Under the hood, tidycensus calls the tigris package to fetch data from the Census Geographic Data APIs.

Only a subset of data available via tigris can be accessed via tidycensus.

Geometry Options

Before fetching geometry, we need to specify a few tigris options

  • Set the class of returned data to be sf objects (not sp, the default)
  • Set tigris_use_cache to TRUE
  • Set cache directory (optional)
# Tigris options - used by tidycensus
options(tigris_class = "sf")      # SP is the default format returned by tigris
options(tigris_use_cache = TRUE)  # Save retrieved data locally
#tigris_cache_dir("~/Documents/gis_data/census")  # Folder for local data

Fetch geographic boundary data with tidycensus

We fetch the geospatial data by setting geometry=TRUE.

pop2010geo <- get_decennial(geography = "state", 
                          variables = c(pop10="P001001"), 
                          year = 2010, 
                          output="wide", 
                          geometry=TRUE) # Fetch geometry with the data for mapping
## Getting data from the 2010 decennial Census

Take a look

pop2010geo
## Simple feature collection with 52 features and 3 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -179.1473 ymin: 17.88481 xmax: 179.7785 ymax: 71.35256
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
## # A tibble: 52 x 4
##    GEOID NAME                     pop10                           geometry
##    <chr> <chr>                    <dbl>                 <MULTIPOLYGON [°]>
##  1 01    Alabama               4779736. (((-85.00237 31.00068, -85.02411 …
##  2 02    Alaska                 710231. (((-164.9762 54.13459, -164.9378 …
##  3 04    Arizona               6392017. (((-109.0452 36.99908, -109.0452 …
##  4 05    Arkansas              2915918. (((-94.55929 36.4995, -94.51948 3…
##  5 06    California           37253956. (((-122.4463 37.86105, -122.4386 …
##  6 08    Colorado              5029196. (((-102.0422 36.99308, -102.0545 …
##  7 09    Connecticut           3574097. (((-71.85957 41.3224, -71.86823 4…
##  8 10    Delaware               897934. (((-75.55945 39.62981, -75.5591 3…
##  9 11    District of Columbia   601723. (((-77.0386 38.79151, -77.0389 38…
## 10 12    Florida              18801310. (((-85.15641 29.67963, -85.1374 2…
## # ... with 42 more rows

Take a look at the file on our local computer

Sys.getenv('TIGRIS_CACHE_DIR') # check cache dir
## [1] "/Users/pattyf/Documents/gis_data/census"

Geospatial Data in R

R sf objects include

  • geometry of type (MULTI) POINT, LINE, POLYGON
    • in the dataframe column named geometry
  • a dataframe of attributes

  • a CRS (coordinate reference system), specified by
    • epsg(SRID) code
    • proj4string

Map it

We can use plot to make a quick map the geometry…

plot(pop2010geo$geometry)

Question

What do you get if you plot the sf object without specifying “$geometry”

The Challenge of US maps

Almost all of the USA is between -66 and -180 degrees latitude.

But Alaska includes a few islands that are between 180-172 degrees latitude.

This makes it difficult to map the US centered at 0,0 lat/lon.

plot(pop2010geo$geometry)

Fetch geographic data with tidycensus, SHIFTED

tidycensus includes a shift_geo parameter to shift AK & HI to below Texas.

pop2010geo_shifted <- get_decennial(geography = "state", 
                                    variables = c(pop10="P001001"), 
                                    output="wide",
                                    year = 2010, 
                                    geometry=TRUE, 
                                    shift_geo=TRUE)
## Getting data from the 2010 decennial Census
## Using feature geometry obtained from the albersusa package
## Please note: Alaska and Hawaii are being shifted and are not to scale.

Shift Happens!

plot(pop2010geo_shifted$geometry)

Save it

You can save sf data to a shapefile using st_write

st_write(pop2010geo_shifted,"usa_2010_shifted.shp")

Mapping Data Values

plot(pop2010geo_shifted['pop10'])

ggplot2 Maps

ggplot(pop2010geo_shifted, aes(fill = pop10)) + 
  geom_sf()

Fancier

ggplot(pop2010geo_shifted, aes(fill = pop10/1000000, color = pop10/1000000)) + 
  geom_sf() +
  scale_fill_viridis_c() + scale_color_viridis_c(guide = FALSE) +
  theme_minimal() +
  labs(title = "Population by State, 2010", fill="Population (millions)")

Challenge

Create a map of CA Population in 2010 by county

2010 pop Data for California Counties

cal_pop10 <- get_decennial(geography = "county", 
                           variables = "P001001",
                           year = 2010, 
                           state='CA',
                           geometry=TRUE)
## Getting data from the 2010 decennial Census

Map it

plot(cal_pop10['value'])

Fetch County data for more than one state

We can fetch both the census data and geometry for more than one state.

west_pop10 <- get_decennial(geography = "county", 
                           variables =  "P001001",
                           year = 2010, 
                           state=c('CA','OR','NV',"AZ"),
                           geometry=T)
## Getting data from the 2010 decennial Census

Census Tract Data

Fetching the data for all tracts in one state.

Need to specify county!.

# Optional state & county argments to get more limited geo subset
alco_pop10 <- get_decennial(geography = "tract", 
                           variables = "P001001", 
                           year = 2010, 
                           state='CA',
                           county='Alameda',
                           geometry=T)
## Getting data from the 2010 decennial Census

Challenge

Fetch and map the 2010 population by census tract for Alameda and Countra Costa counties.

Mapping Tract data

alcc_pop10 <- get_decennial(geography = "tract", 
                      variables = "P001001", 
                      year = 2010, 
                      state='CA',
                      county=c("Alameda","Contra Costa"),
                      geometry=T) 
## Getting data from the 2010 decennial Census
## Getting data from the 2010 decennial Census
## Getting data from the 2010 decennial Census
plot(alcc_pop10['value'])

Challenge

Fetch and map the percent of San Francicso properties by census tract that were coded as rented in the 2010 Census.

Step 1 - indentify the variables for the

  • total number of hounsing units

  • number of renter occupied units

SF Rented Units, 2010

sf_rented <- get_decennial(geography = "tract",  # census tabulation unit
                           variables =  "H004004",
                           year = 2010, 
                           summary_var = "H004001",  # Total Urban - the denominator
                           state='CA',
                           county='San Francisco',
                           geometry=T)

sf_pct_rented <- sf_rented[sf_rented$value > 0,] %>%
                 mutate(pct = 100 * (value / summary_value))

plot(sf_pct_rented['pct'])

Questions?

Part 3. ACS 5 year data

ACS 5 year

You can use the tidycensus get_acs function to retrieve data for the ACS 5 year products, beginning with the 2005 - 2010 dataset.

The default end year is the most recent one in tidycensus which is 2016 as of Dec 6, 2018.

get_acs syntax and options are very similar to get_decennial.

But there are many more data variables!

Fetch List of ACS 5 year Variables

Let’s start by fetching ACS 5-year 2016 data on poverty.

We want to explore the number of folks living below the poverty level by census tract.

First we need to find the variable name(s)!

Load ACS Table Vars

Load the ACS 2012-2016 5 year data variables into a dataframe.

  • ACS 5 year datasets are referenced by end year in tidycensus!

Then search for the variable names.

How many variables refer to the concept of poverty?

acs2016vars <- load_variables(year=2016, dataset = 'acs5', cache = T)
#View(acs2016vars)

ACS Tables and variables

Many hundreds (thousands?) more than for decennial census!

See the documentation on the census website

Types of tables:

  • B prefix = base tables
  • C = collapsed tables
  • DP = data profiles
  • S = Subject tables

Census Reporter

ACS variables can be confusing.

The Census Reporter website (https://censusreporter.org) provides another tool for navigating topics, tables, and variable names.

Let’s check it out to see what tables/variables we should use.

get_acs

Use the tidycensus get_acs function to fetch the poverty data for census tracts in San Francisco

?get_acs

get_acs in action

sf_poor <- get_acs(geography = "tract",  
                   table = "C17002",     # Table: ratio income to poverty
                   year = 2016,          
                   state="CA",
                   summary_var = "C17002_001", # Est of num people - denom
                   county="San Francisco",
                   geometry=T)               
## Getting data from the 2012-2016 5-year ACS

View output

Let’s take a look at the output of get_acs and discuss how it differs from get_decennial.

 sf_poor
## Simple feature collection with 1576 features and 7 fields (with 8 geometries empty)
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -123.0139 ymin: 37.69274 xmax: -122.3276 ymax: 37.86334
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
## First 10 features:
##          GEOID                                               NAME
## 1  06075010100 Census Tract 101, San Francisco County, California
## 2  06075010100 Census Tract 101, San Francisco County, California
## 3  06075010100 Census Tract 101, San Francisco County, California
## 4  06075010100 Census Tract 101, San Francisco County, California
## 5  06075010100 Census Tract 101, San Francisco County, California
## 6  06075010100 Census Tract 101, San Francisco County, California
## 7  06075010100 Census Tract 101, San Francisco County, California
## 8  06075010100 Census Tract 101, San Francisco County, California
## 9  06075010200 Census Tract 102, San Francisco County, California
## 10 06075010200 Census Tract 102, San Francisco County, California
##      variable estimate moe summary_est summary_moe
## 1  C17002_001     3972 291        3972         291
## 2  C17002_002      314 196        3972         291
## 3  C17002_003      397 185        3972         291
## 4  C17002_004      120 130        3972         291
## 5  C17002_005      236 129        3972         291
## 6  C17002_006      123 112        3972         291
## 7  C17002_007      444 304        3972         291
## 8  C17002_008     2338 387        3972         291
## 9  C17002_001     4300 442        4300         442
## 10 C17002_002      160 123        4300         442
##                          geometry
## 1  MULTIPOLYGON (((-122.4206 3...
## 2  MULTIPOLYGON (((-122.4206 3...
## 3  MULTIPOLYGON (((-122.4206 3...
## 4  MULTIPOLYGON (((-122.4206 3...
## 5  MULTIPOLYGON (((-122.4206 3...
## 6  MULTIPOLYGON (((-122.4206 3...
## 7  MULTIPOLYGON (((-122.4206 3...
## 8  MULTIPOLYGON (((-122.4206 3...
## 9  MULTIPOLYGON (((-122.4267 3...
## 10 MULTIPOLYGON (((-122.4267 3...

map it

First remove census tracts that have no people!

sf_poor2 <- subset(sf_poor, summary_est > 0)
plot(sf_poor2['estimate'])

Calcuting percents

Let’s calculate the percent below poverty by tract.

sf_poor3 <- sf_poor2 %>%
  mutate(pct = 100 * (estimate / summary_est))

sf_poor3
## Simple feature collection with 1560 features and 8 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -122.5145 ymin: 37.70813 xmax: -122.3276 ymax: 37.86334
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
## First 10 features:
##          GEOID                                               NAME
## 1  06075010100 Census Tract 101, San Francisco County, California
## 2  06075010100 Census Tract 101, San Francisco County, California
## 3  06075010100 Census Tract 101, San Francisco County, California
## 4  06075010100 Census Tract 101, San Francisco County, California
## 5  06075010100 Census Tract 101, San Francisco County, California
## 6  06075010100 Census Tract 101, San Francisco County, California
## 7  06075010100 Census Tract 101, San Francisco County, California
## 8  06075010100 Census Tract 101, San Francisco County, California
## 9  06075010200 Census Tract 102, San Francisco County, California
## 10 06075010200 Census Tract 102, San Francisco County, California
##      variable estimate moe summary_est summary_moe        pct
## 1  C17002_001     3972 291        3972         291 100.000000
## 2  C17002_002      314 196        3972         291   7.905337
## 3  C17002_003      397 185        3972         291   9.994965
## 4  C17002_004      120 130        3972         291   3.021148
## 5  C17002_005      236 129        3972         291   5.941591
## 6  C17002_006      123 112        3972         291   3.096677
## 7  C17002_007      444 304        3972         291  11.178248
## 8  C17002_008     2338 387        3972         291  58.862034
## 9  C17002_001     4300 442        4300         442 100.000000
## 10 C17002_002      160 123        4300         442   3.720930
##                          geometry
## 1  MULTIPOLYGON (((-122.4206 3...
## 2  MULTIPOLYGON (((-122.4206 3...
## 3  MULTIPOLYGON (((-122.4206 3...
## 4  MULTIPOLYGON (((-122.4206 3...
## 5  MULTIPOLYGON (((-122.4206 3...
## 6  MULTIPOLYGON (((-122.4206 3...
## 7  MULTIPOLYGON (((-122.4206 3...
## 8  MULTIPOLYGON (((-122.4206 3...
## 9  MULTIPOLYGON (((-122.4267 3...
## 10 MULTIPOLYGON (((-122.4267 3...

subset to isolate vars of interest

We want to subset on and then combine the percents of people in each tract who are below 50% and 99% of the poverty line.

sf_poor4 <- subset(sf_poor3, (variable %in% c("C17002_002","C17002_003")))
head(sf_poor4)
## Simple feature collection with 6 features and 8 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -122.4267 ymin: 37.79847 xmax: -122.3996 ymax: 37.81144
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
##          GEOID                                               NAME
## 2  06075010100 Census Tract 101, San Francisco County, California
## 3  06075010100 Census Tract 101, San Francisco County, California
## 10 06075010200 Census Tract 102, San Francisco County, California
## 11 06075010200 Census Tract 102, San Francisco County, California
## 18 06075010300 Census Tract 103, San Francisco County, California
## 19 06075010300 Census Tract 103, San Francisco County, California
##      variable estimate moe summary_est summary_moe      pct
## 2  C17002_002      314 196        3972         291 7.905337
## 3  C17002_003      397 185        3972         291 9.994965
## 10 C17002_002      160 123        4300         442 3.720930
## 11 C17002_003       75  55        4300         442 1.744186
## 18 C17002_002      194 102        4341         426 4.469016
## 19 C17002_003      431 204        4341         426 9.928588
##                          geometry
## 2  MULTIPOLYGON (((-122.4206 3...
## 3  MULTIPOLYGON (((-122.4206 3...
## 10 MULTIPOLYGON (((-122.4267 3...
## 11 MULTIPOLYGON (((-122.4267 3...
## 18 MULTIPOLYGON (((-122.4187 3...
## 19 MULTIPOLYGON (((-122.4187 3...

Group by and sum

sf_poor5 <- sf_poor4 %>%
  select(GEOID, pct, geometry) %>%
  group_by(GEOID) %>% 
  summarise(pct_below_pov = sum(pct))

head(sf_poor5)
## Simple feature collection with 6 features and 2 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -122.4267 ymin: 37.79323 xmax: -122.3897 ymax: 37.81144
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
## # A tibble: 6 x 3
##   GEOID       pct_below_pov                                       geometry
##   <chr>               <dbl>                                  <POLYGON [°]>
## 1 06075010100         17.9  ((-122.4206 37.81111, -122.4075 37.81144, -12…
## 2 06075010200          5.47 ((-122.4267 37.80964, -122.4249 37.8108, -122…
## 3 06075010300         14.4  ((-122.4187 37.80593, -122.4169 37.80521, -12…
## 4 06075010400          7.93 ((-122.4149 37.80354, -122.4116 37.80396, -12…
## 5 06075010500          8.53 ((-122.4052 37.80476, -122.4035 37.80509, -12…
## 6 06075010600         24.7  ((-122.411 37.80117, -122.4078 37.80157, -122…

Map it

Where are SF’s poorest areas?

plot(sf_poor5['pct_below_pov'])

Interactive Maps with tmap

tmap

The tmap package is great for making both static and interactive maps. It turns R into a GIS.

Let’s check it out with our last dataframe.

tmap

library(tmap)
## Warning: package 'tmap' was built under R version 3.4.4
tmap_mode("view") # set mode to interactive
## tmap mode set to interactive viewing
poverty_map <- tm_shape(sf_poor5) +
                  tm_polygons(col="pct_below_pov")

tmap

View the map - click on tracts

poverty_map

tmap

There are a number of great tutorials online for working with tmap.

See the References at the end of this workshop document.

Types of Census Geographic Data

Cartographic Boundary data vs Detailed TIGER/Line data

By default, tidycensus downloads census cartographic boundary data.

These are simplifed geometries, clipped to coastlines.

In get_acs you can also request the more detailed census TIGER data.

Let’s check it out.

Fetch Cartographic Boundary Data

sf_poor_cb <- get_acs(geography = "tract",   
                   table = "C17002", 
                   summary_var = "C17002_001",
                   year = 2016,           
                   state="CA",
                   county="San Francisco",
                   geometry=TRUE,
                   cb = TRUE)     # THIS IS THE DEFAULT!
## Getting data from the 2012-2016 5-year ACS

Fetch Detailed TIGER/Line Geometry

sf_poor_tl <- get_acs(geography = "tract",   
                   table = "C17002",        
                   summary_var = "C17002_001",
                   year = 2016,              
                   state="CA",
                   county="San Francisco",
                   geometry=TRUE,
                   cb = FALSE)  # Fetching the TIGER/Line data  
## Getting data from the 2012-2016 5-year ACS

Map it

par(mfrow=c(1,2))
plot(sf_poor_cb[sf_poor$summary_est>0,]$geometry)
plot(sf_poor_tl[sf_poor$summary_est>0,]$geometry)

par(mfrow=c(1,1))

Visualize differences with Tmap

tm_shape(sf_poor_tl) +
    tm_borders() +
tm_shape(sf_poor_cb) +
    tm_borders(col="red")

Challenge

TO BE ADDED….

Questions?

Scaling Up Example

Read in census variables of interest from a file

# Load cenvar lookup table of vars of interest
my_cenvar_df <-read.csv("data/cenvar_lookup.csv", strip.white = T, stringsAsFactors = F)

my_cenvar_df
##              my_cen_var_names my_cen_vars
## 1          citizenship_totpop B05001_001E
## 2     citizenship_non_citizen B05001_006E
## 3                entry_totpop B05005_001E
## 4                  entry_2010 B05005_002E
## 5             entry_2000_2009 B05005_007E
## 6           birthplace_totpop B05007_001E
## 7            birthplace_europ B05007_014E
## 8            birthplace_asian B05007_027E
## 9     birthplace_latinAmerica B05007_040E
## 10    birthplace_southAmerica B05007_081E
## 11    birthplace_other_nonUSA B05007_094E
## 12    birthplace_byage_totpop B06001_001E
## 13     birthplace_byage_fborn B06001_049E
## 14             poverty_totpop B06012_001E
## 15                  below_pov B06012_002E
## 16                 below_pov2 B06012_003E
## 17       poverty_fborn_totpop B06012_017E
## 18            below_pov_fborn B06012_018E
## 19           below_pov2_fborn B06012_019E
## 20       health_native_totpop B27020_002E
## 21  health_native_noinsurance B27020_006E
## 22    health_fborn_nat_totpop B27020_008E
## 23 fborn_nohealth_naturalized B27020_012E
## 24 health_fborn_noncit_totpop B27020_013E
## 25  fborn_nohealth_noncitizen B27020_017E

Fetch the ACS data

Fetch the ACS data for these variables for the 9 county bay area

nine_counties <- c("001", "075", "013", "041", "055", "081", "085", "095", "097", "087")
bay9_data <-get_acs(geography = "tract", 
                       variables = my_cenvar_df$my_cen_vars, 
                       year=2016,
                       state = "CA", 
                       county = nine_counties, 
                       geometry = T)
## Getting data from the 2012-2016 5-year ACS
bay9_data
## Simple feature collection with 41025 features and 5 fields (with 175 geometries empty)
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -123.5335 ymin: 36.85065 xmax: -121.2082 ymax: 38.86424
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
## First 10 features:
##          GEOID                                          NAME   variable
## 1  06001400100 Census Tract 4001, Alameda County, California B05001_001
## 2  06001400100 Census Tract 4001, Alameda County, California B05001_006
## 3  06001400100 Census Tract 4001, Alameda County, California B05005_001
## 4  06001400100 Census Tract 4001, Alameda County, California B05005_002
## 5  06001400100 Census Tract 4001, Alameda County, California B05005_007
## 6  06001400100 Census Tract 4001, Alameda County, California B05007_001
## 7  06001400100 Census Tract 4001, Alameda County, California B05007_014
## 8  06001400100 Census Tract 4001, Alameda County, California B05007_027
## 9  06001400100 Census Tract 4001, Alameda County, California B05007_040
## 10 06001400100 Census Tract 4001, Alameda County, California B05007_081
##    estimate moe                       geometry
## 1      3018 195 MULTIPOLYGON (((-122.2469 3...
## 2       218 106 MULTIPOLYGON (((-122.2469 3...
## 3       944 168 MULTIPOLYGON (((-122.2469 3...
## 4        61  50 MULTIPOLYGON (((-122.2469 3...
## 5       176  99 MULTIPOLYGON (((-122.2469 3...
## 6       843 156 MULTIPOLYGON (((-122.2469 3...
## 7       208  75 MULTIPOLYGON (((-122.2469 3...
## 8       546 141 MULTIPOLYGON (((-122.2469 3...
## 9        46  43 MULTIPOLYGON (((-122.2469 3...
## 10       11  17 MULTIPOLYGON (((-122.2469 3...

Reformat Ouput

  1. we only keep the columns of interest - including GEOID and geometry
  2. each estimate variable is in its own column
bay9_data2 <- bay9_data %>%
  select("GEOID", "variable", "estimate") %>%
  spread(key=variable, value=estimate)

bay9_data2
## Simple feature collection with 1641 features and 26 fields (with 7 geometries empty)
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -123.5335 ymin: 36.85065 xmax: -121.2082 ymax: 38.86424
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
## First 10 features:
##          GEOID B05001_001 B05001_006 B05005_001 B05005_002 B05005_007
## 1  06001400100       3018        218        944         61        176
## 2  06001400200       1960         92        315         33         55
## 3  06001400300       5236        317        900        130        314
## 4  06001400400       4171        190        524         69        115
## 5  06001400500       3748        430        701        226        123
## 6  06001400600       1661         62        205         21          9
## 7  06001400700       4552        353        619        122        177
## 8  06001400800       3506        276        767        108        185
## 9  06001400900       2262        202        446         20         16
## 10 06001401000       6193        477        754         90        186
##    B05007_001 B05007_014 B05007_027 B05007_040 B05007_081 B05007_094
## 1         843        208        546         46         11         43
## 2         243         65        136         28         19         14
## 3         857        119         90        257         86        391
## 4         471        146        228         35         15         62
## 5         635        160         83        245         37        147
## 6         178         49         74         38          0         17
## 7         587         70        145        299         67         73
## 8         741        181        330        149          8         81
## 9         405         20        136        148         38        101
## 10        671         56         46        519         67         50
##    B06001_001 B06001_049 B06012_001 B06012_002 B06012_003 B06012_017
## 1        3018        843       3011        113         20        843
## 2        1960        243       1952        106         84        243
## 3        5236        857       5153        450        217        848
## 4        4171        471       4158        268        198        471
## 5        3748        635       3733        339        240        635
## 6        1661        178       1656        158        229        178
## 7        4552        587       4552        820        723        587
## 8        3506        741       3457        381        348        741
## 9        2262        405       2228        358        268        405
## 10       6193        671       6184       1466        768        671
##    B06012_018 B06012_019 B27020_002 B27020_006 B27020_008 B27020_012
## 1          31         12       2175         88        625         12
## 2          31         22       1717         38        151          0
## 3         124        183       4379        111        540         34
## 4          52         82       3694        152        281          0
## 5         151         58       3113        243        205          6
## 6          59         21       1483        143        116          8
## 7         107        103       3965        512        234         33
## 8          75        191       2759        184        465         99
## 9          66         39       1857        103        203         15
## 10        120         17       5513        463        194          0
##    B27020_013 B27020_017                       geometry
## 1         218         10 MULTIPOLYGON (((-122.2469 3...
## 2          92         35 MULTIPOLYGON (((-122.2574 3...
## 3         317         15 MULTIPOLYGON (((-122.2642 3...
## 4         190         21 MULTIPOLYGON (((-122.2618 3...
## 5         430        148 MULTIPOLYGON (((-122.2694 3...
## 6          62          0 MULTIPOLYGON (((-122.2681 3...
## 7         353        147 MULTIPOLYGON (((-122.2779 3...
## 8         276         79 MULTIPOLYGON (((-122.2887 3...
## 9         202         28 MULTIPOLYGON (((-122.2856 3...
## 10        477        111 MULTIPOLYGON (((-122.2787 3...

Rename the columns

Use our dataframe of census variables to rename the columns so that they are self-describing.

colnames(bay9_data2)<-c("GEOID", my_cenvar_df$my_cen_var_names, "geometry")

bay9_data2
## Simple feature collection with 1641 features and 26 fields (with 7 geometries empty)
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -123.5335 ymin: 36.85065 xmax: -121.2082 ymax: 38.86424
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
## First 10 features:
##          GEOID citizenship_totpop citizenship_non_citizen entry_totpop
## 1  06001400100               3018                     218          944
## 2  06001400200               1960                      92          315
## 3  06001400300               5236                     317          900
## 4  06001400400               4171                     190          524
## 5  06001400500               3748                     430          701
## 6  06001400600               1661                      62          205
## 7  06001400700               4552                     353          619
## 8  06001400800               3506                     276          767
## 9  06001400900               2262                     202          446
## 10 06001401000               6193                     477          754
##    entry_2010 entry_2000_2009 birthplace_totpop birthplace_europ
## 1          61             176               843              208
## 2          33              55               243               65
## 3         130             314               857              119
## 4          69             115               471              146
## 5         226             123               635              160
## 6          21               9               178               49
## 7         122             177               587               70
## 8         108             185               741              181
## 9          20              16               405               20
## 10         90             186               671               56
##    birthplace_asian birthplace_latinAmerica birthplace_southAmerica
## 1               546                      46                      11
## 2               136                      28                      19
## 3                90                     257                      86
## 4               228                      35                      15
## 5                83                     245                      37
## 6                74                      38                       0
## 7               145                     299                      67
## 8               330                     149                       8
## 9               136                     148                      38
## 10               46                     519                      67
##    birthplace_other_nonUSA birthplace_byage_totpop birthplace_byage_fborn
## 1                       43                    3018                    843
## 2                       14                    1960                    243
## 3                      391                    5236                    857
## 4                       62                    4171                    471
## 5                      147                    3748                    635
## 6                       17                    1661                    178
## 7                       73                    4552                    587
## 8                       81                    3506                    741
## 9                      101                    2262                    405
## 10                      50                    6193                    671
##    poverty_totpop below_pov below_pov2 poverty_fborn_totpop
## 1            3011       113         20                  843
## 2            1952       106         84                  243
## 3            5153       450        217                  848
## 4            4158       268        198                  471
## 5            3733       339        240                  635
## 6            1656       158        229                  178
## 7            4552       820        723                  587
## 8            3457       381        348                  741
## 9            2228       358        268                  405
## 10           6184      1466        768                  671
##    below_pov_fborn below_pov2_fborn health_native_totpop
## 1               31               12                 2175
## 2               31               22                 1717
## 3              124              183                 4379
## 4               52               82                 3694
## 5              151               58                 3113
## 6               59               21                 1483
## 7              107              103                 3965
## 8               75              191                 2759
## 9               66               39                 1857
## 10             120               17                 5513
##    health_native_noinsurance health_fborn_nat_totpop
## 1                         88                     625
## 2                         38                     151
## 3                        111                     540
## 4                        152                     281
## 5                        243                     205
## 6                        143                     116
## 7                        512                     234
## 8                        184                     465
## 9                        103                     203
## 10                       463                     194
##    fborn_nohealth_naturalized health_fborn_noncit_totpop
## 1                          12                        218
## 2                           0                         92
## 3                          34                        317
## 4                           0                        190
## 5                           6                        430
## 6                           8                         62
## 7                          33                        353
## 8                          99                        276
## 9                          15                        202
## 10                          0                        477
##    fborn_nohealth_noncitizen                       geometry
## 1                         10 MULTIPOLYGON (((-122.2469 3...
## 2                         35 MULTIPOLYGON (((-122.2574 3...
## 3                         15 MULTIPOLYGON (((-122.2642 3...
## 4                         21 MULTIPOLYGON (((-122.2618 3...
## 5                        148 MULTIPOLYGON (((-122.2694 3...
## 6                          0 MULTIPOLYGON (((-122.2681 3...
## 7                        147 MULTIPOLYGON (((-122.2779 3...
## 8                         79 MULTIPOLYGON (((-122.2887 3...
## 9                         28 MULTIPOLYGON (((-122.2856 3...
## 10                       111 MULTIPOLYGON (((-122.2787 3...

Questions?

Next steps

ACS 5 Year

Latest is 2012 - 2016

2013 - 2017 release date is dec 8, 2018

See: https://www.census.gov/programs-surveys/acs/news/data-releases/2017/release-schedule.html

See when tidycensus package is updated….!!!

Extras for Enthusiasts

Fetching data for multiple years

Requires variable name to be the same!

# use purr::map_df to get data for multiple years (must have same vars!)
pop90_10 <- map_df(c(1990, 2000, 2010), function(x) { 
  get_decennial(geography = "state",
  variables = c(totalpop = "P001001"),
  dataset = "sf1",
  year = x) %>%
  mutate(year = x) }
)

# View output
head(pop90_10)
tail(pop90_10)

# Plot it
pop90_10 %>% ggplot(aes(x=reorder(NAME,value), y=value/1000000, fill=factor(year))) + 
             geom_bar(stat="identity", position=position_dodge()) + coord_flip()

Combining Census Data with Other Data

Area Weighted Interpolation

One of the strenghts of the sf package is how relatively easy it is to reaggregate data from one geometry to another. This process is called areal interpolation.

Area weighted interpolation reaggregates the data based on the percent of area shared by input and output geometeries.

Read in a Shapefile

sfnhoods<- st_read("data/sfnhoods.shp")
head(sfnhoods)
plot(sfnhoods['nhood'])

Check the CRS

st_crs(sfnhoods)
st_crs(sf_poor5)

CRS transformation

sf_poor5_4326 = st_transform(sf_poor5, st_crs(sfnhoods))

Area Weighted Interpolation

Reaggregate percent of people below poverty from census tract to neighborhood polygons.

sfhoods2 = st_interpolate_aw(sf_poor5_4326[, "pct_below_pov"], sfnhoods,
extensive = F) # True= aw sum; False= aw avg

Map it

par(mfrow=c(1,2))
plot(sf_poor5['pct_below_pov'])
plot(sfhoods2['pct_below_pov'])
par(mfrow=c(1,1))

Map it with tmap

tm_shape(sfhoods2) +
   tm_polygons(col="pct_below_pov")

Combine the values

head(sfhoods2)
sfnhoods$pct_below_pov <- sfhoods2$pct_below_pov

# map again - click on polygons and view data in popups
# to confirm the AWI output values
tm_shape(sfnhoods) +
  tm_polygons(col="pct_below_pov", 
    popup.vars = c("nhood", "pct_below_pov")
  )